06 - Object detection - how to train own model
Robotics I
Poznan University of Technology, Institute of Robotics and Machine Intelligence
Laboratory 6: Object detection - how to train own model
Goals
The objectives of this laboratory are to:
- Understand the basics of object detection
- Learn how to train your own object detection model
Resources
- Good overview of the object detection task
- The benchmark of the best object detection models
- Ultralytics documentation
- YOLOv1 to YOLOv10: A comprehensive review of YOLO variants
How to define an object detection task?
Source:
Top
10 Object Detection Models in 2023!
Object detection is a computer vision task that involves localizing one or more objects within an image and classifying each object in the image. The goal is to find the bounding box (rectangle) coordinates of each object in the image along with its class label.
Usually, object detection bounding boxes are defined: - using
left-top corner (x1, y1) and right-bottom corner (x2, y2) coordinates:
(x1, y1, x2, y2)
- using left-top corner (x, y) and width
and height: (x, y, w, h)
- using center (x, y) and width
and height: (x, y, w, h)
Additionally, object detection results contain: - confidence score: a value that represents the probability that the detected object exists in the bounding box (“objectness score”) - class label: a label that represents the class of the detected object, usually represented as an integer value or a list of class probabilities
Note: Be careful with the bounding box format used in the dataset you are working with. If you are not sure, check the dataset documentation or visualize the bounding boxes to understand the format.
Object detection architectures
Source:
Semantic
Image Cropping
Object detection architectures can be divided into two main categories:
Two-stage detectors: These detectors first generate region proposals and then classify the regions. Examples of two-stage detectors are Faster R-CNN, R-FCN, and FPN.
One-stage detectors: These detectors directly predict the bounding boxes and class probabilities. Examples of one-stage detectors are YOLO, SSD, and RetinaNet.
Usually, two-stage detectors are more accurate but slower than one-stage detectors. The choice of the architecture depends on the application requirements, such as speed and accuracy. In robotics, the choice of the architecture depends on the robot’s computational resources and the task requirements. Still, one-stage detectors are usually preferred due to their speed and ability to run in real time.
YOLO (You Only Look Once)
One of the most popular one-stage object detection architectures, especially in robotics and real-time applications, is YOLO (You Only Look Once). YOLO is a series of fast and accurate object detection models. The first version of YOLO was introduced in 2016, and since then, several versions have been released, evolving the architecture and improving performance.
Source:
The
AiEdge+: Let’s Make Computer Vision Great Again!
The model is a simple convolutional network with the output of the last convolution layer having the dimensionality of the target to predict. This means that for each cell, the model will predict if there is an object (the center of a bounding box), the probability for each of the classes and the dimensions and positions of the resulting bounding boxes, this for each of the priors.
Because the model will likely predict multiple bounding boxes for the same object, it is necessary to select the best ones. The idea is to choose the box with the highest confidence score, measure its intersection area over union area (IOU) with all the other overlapping boxes of the same class, and remove all that are above a certain threshold. This is called non-maximum suppression. This ensures that boxes with high overlaps are merged into one.
Source: The AiEdge+: Let’s Make Computer Vision Great Again!
Note: Everything we do today should be done inside the container!
💥 💥 💥 Task 💥 💥 💥
In this task, you will train your own object detection model using the YOLO architecture.
Requirements
A graphics processing unit (GPU) is required to train the object detection model. If you don’t have an NVIDIA GPU, you can use the CPU version of the container, but the training process will be very slow.
Get the
ros2_detection_gpu
orros2_detection_cpu
image:Note: Before you start downloading or building the image, check
docker images
to see if it has already been downloaded.Option 1 - Download GPU version or Download CPU version. Load the docker image with
docker load < ros2_detection_gpu.tar
/docker load < ros2_detection_cpu.tar
.Option 2 - Build it from the source using repository.
Run the ROS2_callibration container (GPU or CPU version):
docker_run_detection_gpu.sh
IMAGE_NAME="ros2_detection_gpu:latest" CONTAINER_NAME="" # student ID number xhost +local:root XAUTH=/tmp/.docker.xauth if [ ! -f $XAUTH ] then xauth_list=$(xauth nlist :0 | sed -e 's/^..../ffff/') if [ ! -z "$xauth_list" ] then echo $xauth_list | xauth -f $XAUTH nmerge - else touch $XAUTH fi chmod a+r $XAUTH fi docker stop $CONTAINER_NAME || true && docker rm $CONTAINER_NAME || true docker run -it \ --env="DISPLAY=$DISPLAY" \ --env="QT_X11_NO_MITSHM=1" \ --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \ --env="XAUTHORITY=$XAUTH" \ --volume="$XAUTH:$XAUTH" \ --privileged \ --network=host \ --gpus all \ --env="NVIDIA_VISIBLE_DEVICES=all" \ --env="NVIDIA_DRIVER_CAPABILITIES=all" \ --shm-size=1024m \ --name="$CONTAINER_NAME" \ $IMAGE_NAME \ bash
docker_run_detection_cpu.sh
IMAGE_NAME="ros2_detection_cpu:latest" CONTAINER_NAME="123" # student ID number xhost +local:root XAUTH=/tmp/.docker.xauth if [ ! -f $XAUTH ] then xauth_list=$(xauth nlist :0 | sed -e 's/^..../ffff/') if [ ! -z "$xauth_list" ] then echo $xauth_list | xauth -f $XAUTH nmerge - else touch $XAUTH fi chmod a+r $XAUTH fi docker stop $CONTAINER_NAME || true && docker rm $CONTAINER_NAME || true docker run -it \ --env="DISPLAY=$DISPLAY" \ --env="QT_X11_NO_MITSHM=1" \ --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \ --env="XAUTHORITY=$XAUTH" \ --volume="$XAUTH:$XAUTH" \ --privileged \ --network=host \ --shm-size=1024m \ --name="$CONTAINER_NAME" \ $IMAGE_NAME \ bash
Preparation
This container is prepared for two laboratories. For today’s laboratory, go to the
model_training
directory, where all necessary scripts are located.Download the dataset. This script will download the validation dataset from the COCO dataset (about 5000 images). The whole COCO dataset contains 118k images; therefore, we will use a subset of the dataset to speed up the training process during the laboratory.
bash scripts/01_download_dataset.bash
The script generates the datasets
directory with the
following structure:
datasets/
├── coco_val2017
│ ├── images
│ │ ├── 000000000139.jpg
│ │ ├── ...
│ │ └── 000000581781.jpg
│ └── labels
│ ├── 000000000139.txt
│ ├── ...
│ └── 000000581781.txt
Check out the sample labels. The labels are in the YOLO format, where the first column is the class index, and the following four columns are the bounding box coordinates in the format:
class x_center y_center width height
Note: Box coordinates are normalized to the image width and height, so they are in the range [0, 1].
- The original COCO dataset contains 80 classes, too many for our purposes. Therefore, we will filter the dataset to keep labels only for the following classes: `‘person’, ‘bicycle’, ‘car’, ‘motorcycle’, ‘bus’, and ‘truck’.
Check the scripts/02_filter_labels.py
script to see the
instructions. When you fill all gaps in the script, run the command:
bash scripts/02_filter_labels.py
Validate an example label to see if the script works correctly.
- As we use the validation subset of the COCO dataset, we need to
split the dataset into training and validation sets manually. Check the
scripts/03_split_dataset.py
script to see the instructions. When you fill all gaps in the script, run the command:
bash scripts/03_split_dataset.py
It generates train_list.txt
and
val_list.txt
files in the datasets
directory.
Check the files to see if the script works correctly.
Note: Validate paths and classes in the
configs/coco128_filtered.yaml
file. It will be used in the next steps.
Training and validation
Note: All following scripts use a GPU. If you don’t have a GPU, you can use the
device=cpu
parameter with every command. Using neural networks on a CPU is very slow, so be patient or reduce parameters likeimgsz
orbatch
.
- Train the YOLO model using the following command. It uses the
pre-trained model
yolo11.pt
weights, which should improve the training results and speed up the process. You can find all training parameter definitions here.
yolo detect train data=configs/coco128_filtered.yaml model=yolo11n.pt epochs=20 imgsz=512 batch=16
During the training process, you can check the
runs/detect/
directory to see the training progress and
generated plots.
- When the training process is finished, you can evaluate the model using the validation dataset. For laboratory purposes, we ignore the testing subset, just evaluating on the validation subset.
yolo val data=configs/coco128_filtered.yaml imgsz=640 batch=16 conf=0.25 iou=0.6 split=val model=<PATH_TO_BEST_MODEL.PT>
Save the evaluation results to the results.txt
file.
- At the last step, you can run the model on your own data. By
specifying the
source
parameter, you can use an image, video, or camera stream, a file on your disk, or a URL to the image or video.
yolo predict data=configs/coco128_filtered.yaml imgsz=640 model=<PATH_TO_BEST_MODEL.PT> source="<URL>"
💥 💥 💥 Assignment 💥 💥 💥
To pass the course, you need to upload the following files to the eKursy platform:
best.pt
- the best model weights (you can use it for inference on next laboratories instead of example weights)results.txt
- create a text file with the results of the model evaluation (results of theyolo val
command)